Abstract

LTR retrotransposons (LTR-RTs) are mobile genetic elements constituting remarkable portions of plant genomes and significantly contribute to genome structure, size and regulations. Some LTR-RTs lineages are distributed uniformly while others are localized in centromeres of monocentric chromosomes. But little is known about characterization and distribution of retrotransposons within genomes with holocentric chromosomes. We will test our hypothesis that the LTR RTs in holocentrics will be evenly distributed along chromosomes. Therefore we designed the comparative study with three groups where each is composed from two holocentric plant species with different karyotypes and one phylogenetically closest monocentric relative. These species will be sequenced using: (1) Oxford Nanopore Technologies (long reads) and (2) Illumina sequencing (short reads). The primary assembly will be generated and subjected to the in-depth LTR-RTs detection analysis. Our investigation will provide information on LTR-RTs families abundance and distribution in relation to the phenomenon of holocentrism.



Project workflow

  1. Plant species selection: Holocentric plant species are chosen across phylogenetically distant taxons and concurrently within one taxon/genera pick up at least two species with remarkably different karyotype. Each higher holocentric taxon will be chosen its phylogenetically closest monocentric counterparts. Representatives of monocots and eudicots will be included (for plant species list see Table 1).
  2. Genomic DNA sequencing: We will obtain a partial and/or full-length sequences of LTR retrotransposons from all plant species and a primary genome assembly of plant subset using combination of long (Oxford Nanopore Technologies) and short (Illumina sequencing) reads.
  3. Subsequent analysis: o in-depth search and annotation of full-length LTR retrotransposons o their distribution in chromosomes (alternatively in assembled scaffolds) o identification of additional sequences carried within retroelements with possible function for the holocentric arrangement – satellites, tandem repeats, retrogenes, eORF, (micro) miRNA
  4. Based on information about abundance of LTR retrotransposon families and their distribution in our model plant species, we will describe LTR retrotransposon landscape and its relation to holocentrism.




1 Model species

Table 1. Plant species used in the study


1.1 gDNA isolation

1.1.1 NucleoSpin and Promega Wizard

gDNA isolation by J. Smerda (2020)



1.1.2 Viktor (the best) isolation

...




2 NGS sequencing

2.1 ONT MinION long read sequencing

2.1.1 Estimated genome coverage summary


2.1.2 Sequencing progress

Rapid Sequencing Kit (SQK-RAD004) was used for DNA library preparation of first five species

gDNA was isolated using Promegas’ Wizard® HMW DNA Extraction Kit (C. acutiformis, E. uniglumis and L. elegans) and/or using Viktor’s phenol/chloroform based isolation protocol (O. sativa and L. pilosa)


Nanopore ‘pass’ reads statistics


2.1.3 Comparison of ‘fast’ and ‘hac’ (high accuracy) basecallings

Fastq quality stats of S. cereale 673 fast5 files after fast and hac basecalling


2.1.4 Read quality check

The longest reads are in ‘fastq_fail’

Check on reads from O. sativa run:

O. sativa reads quality distribution



O. sativa reads quality distribution



Dividing fastq reads into qulity based bins:



O. sativa reads quality distribution



O. sativa reads quality distribution



Mapping nanopore reads to O. sativa genome assembly (Osativa_323_v7.0.fa; Ouyang et al., 2007) using minimap2

# creating index from ref:
minimap2 -x map-ont -d osat-ont.mmi Osat.fa

# input fq files:
head fqList.txt 
Osat_sub100kSeq_reads_qlt_bin_12_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_15_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_4_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_5_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_7_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_9_v2.fastq

# running minimap2:
for f in $(cat fqList.txt); do n=$(echo $f | sed 's/.fastq//g') && minimap2 -ax map-ont osat-ont.mmi $f > $n'_mapped.sam'; done

# mapping stats in brief:
samtools flagstat Osat_sub100kSeq_reads_qlt_bin_5_v2_mapped.sam
4444 + 0 in total (QC-passed reads + QC-failed reads)
446 + 0 secondary
174 + 0 supplementary
0 + 0 duplicates
1660 + 0 mapped (37.35% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


O. sativa reads quality distribution



O. sativa reads quality distribution



Running sliding windows for quality check through >100kbp long reads:


O. sativa reads quality within sliding windows (read quality mean and median, respectively)


The extra long reads over 100kbp are exclusively in bin with Q up to 4. It means that nucleotide error probability is 10^(-4/10)=0.398.

So far we do not have enough supporting arguments for using of long reads.


2.1.5 Genome assembly

Example of genome assembly, polishing, and scaffolding (Choi et al., 2020: “Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice”)

"After completion of sequencing, the raw signal intensity data was used for base calling using flip flop (version 2.3.5) from Oxford Nanopore Technologies. Reads with a mean qscore (quality) greater than 8 and a read length greater than 8 kb were used and trimmed for adaptor sequences using Porechop (https://github.com/rrwick/Porechop). Raw nanopore sequencing reads were corrected using the program Canu (Koren et al., 2017) and then assembled with the genome assembler Flye (Kolmogorov et al., 2019)

The initial draft assemblies were polished for three rounds using the raw nanopore reads with Racon ver. 1.2.1 (Vaser et al., 2017) and one round with Medaka (https://github.com/nanoporetech/medaka) from Oxford Nanopore Technologies. Afterwards, reads from Illumina sequencing were used by bwa-mem (Li 2013) to align to the draft genome assemblies. The alignment files were then used by Pilon ver. 1.22 (Walker et al, 2014) for three rounds of polishing.

Contigs were scaffolded using a reference genome-guided scaffolding approach implemented in RaGOO (Alonge et al., 2019). Using the Nipponbare genome as a reference, we aligned the circum-basmati genomes using Minimap2 (Li, 2018). RaGOO was then used to order the assembly contigs. Space between contigs was artificially filled in with 100 “N” blocks.

Genome assembly statistics were calculated using the bbmap stats.sh script from the BBTools suite (https://jgi.doe.gov/data-and-tools/bbtools/)."


2.1.5.1 Pipeline

## Linking to librsvg 2.48.9

Genome assembly pipeline



2.1.5.2 Read preparation, filtering and primary assembly

Script

nano_fq_chop_filt_assembly.sh

#!/bin/bash

# get raw fq list
ls fastq_pass/FAO*fastq > fq_list.txt

# adapter trimming using porechop
mkdir 1_adaptor_trimming_porechop
for fq in $(cat fq_list.txt)
do
    n=$(echo $fq | cut -f 2 -d "/" | sed 's/.fastq/_chopped.fastq/g') 
    porechop -i $fq -o '1_adaptor_trimming_porechop/'$n
done

# gzip chopped fqs
cd 1_adaptor_trimming_porechop
cat *_chopped.fastq >> allFastq.fastq
gzip -c allFastq.fastq > allFq_chopped.fastq.gz

# nanofilt
gunzip -c allFq_chopped.fastq.gz | NanoFilt -q 7 -l 500 --headcrop 75 | gzip > all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz
cd ..

# nanoQC stats
nanoQC -o 2_nanoQC_chopped_stats 1_adaptor_trimming_porechop/all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz

# canu corrected fqs [generates 'canuCorrect.correctedReads.fasta.gz']
mkdir 3_canu_correct
cd 3_canu_correct
cp ../1_adaptor_trimming_porechop/all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz .

canu -correct -p canuCorrect -d assembly genomeSize=10m -nanopore-raw all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz
cd ..

corrFa='3_canu_correct/canuCorrect.correctedReads.fasta.gz'

# run genome assembly using 4 different tools
# 1.
mkdir 4_flye_correctedReads_assembly
cp $corrFa 4_flye_correctedReads_assembly/.
cd 4_flye_correctedReads_assembly
flye --nano-raw canuCorrect.correctedReads.fasta.gz --out-dir flye_assembly_out --threads 6
cd ..

# 2.
mkdir 5_canu_correctedReads_assembly
cp $corrFa 5_canu_correctedReads_assembly/.
cd 5_canu_correctedReads_assembly
canu -p corrReads -d assembly_corrReads genomeSize=80m -nanopore-corrected canuCorrect.correctedReads.fasta.gz
cd ..

# 3. 
mkdir 6_raven_correctedReads_assembly
cp $corrFa 6_raven_correctedReads_assembly/.
cd 6_raven_correctedReads_assembly/.
raven canuCorrect.correctedReads.fasta.gz -t 6 > raven_corrReads_assembly.fasta
cd ..

# 4.
mkdir 7_redbean_wtdbg2_correctedReads_assembly
cp $corrFa 7_redbean_wtdbg2_correctedReads_assembly/.
cd 7_redbean_wtdbg2_correctedReads_assembly
~/Documents/Soft/wtdbg2/wtdbg2.pl -t 6 -x ont -g 100m -o wtdbg2 canuCorrect.correctedReads.fasta.gz
cd ..

printf "XXX FINISHED!!! XXX"


Assembly using four different assemlers:



2.1.5.3 Assemblers comparison - O. sativa reads


2.1.5.4 Flye genome assembly comparison with Os-Nipponbare-Reference-IRGSP-1.0

MUMmer: a system for rapidly aligning entire genomes

~/Documents/Soft/mummer/nucmer -p nucmerAlign IRGSP-1.0_genome.fasta Osat_flye_assembly.fasta
~/Documents/Soft/mummer/dnadiff -d nucmerAlign.delta
~/Documents/Soft/mummer/mummerplot -l nucmerAlign.delta

# filter coordinates for Ribbon visualization
cat out.1coords | cut -f 1,2,3,4,5,6,7,8,9,12,13 > out_for_ribbon.1coords
cat out.mcoords | cut -f 1,2,3,4,5,6,7,8,9,12,13 > out_for_ribbon.mcoords

cat out.mcoords | cut -f 1,2,3,4,5,6,7,8,9,12,13 | head
1244    12042   10602   21384   10799   10783   95.93   43270923    53731   chr01   contig_4247
12145   44923   21384   53731   32779   32348   97.93   43270923    53731   chr01   contig_4247
27895   27978   58454   58537   84  84  100.00  43270923    169368  chr01   contig_1987
44125   46854   2665    1   2730    2665    90.39   43270923    139529  chr01   contig_3868
46353   54453   243025  250984  8101    7960    92.76   43270923    405317  chr01   contig_1301
50824   66551   51959   36442   15728   15518   97.01   43270923    51959   chr01   contig_4723
66383   66678   219651  219359  296 293 86.84   43270923    233972  chr01   contig_1596
67259   103694  35777   1   36436   35777   96.41   43270923    51959   chr01   contig_4723
98493   103348  37  4703    4856    4667    90.33   43270923    50883   chr01   contig_1903
118081  127998  82173   72748   9918    9426    88.06   43270923    146834  chr01   contig_1124


dotPlotly visualization

O. sativa flye assembly aligned to nipponbare IRGSP-1.0 genome - dotPlotly overview


Visualization in Ribbon

O. sativa flye assembly aligned to nipponbare IRGSP-1.0 genome - overview

O. sativa flye assembly aligned to nipponbare IRGSP-1.0 genome - zoom of long contig



2.1.5.5 O. sativa flye genome assembly comparison fast versus hac basecaller config

O. sativa query statistics of Mummer mapping to nipponbare reference



O. sativa dotPlotly alignment to nipponbare reference



2.1.6 Repeats identification in nanopore reads

te-g-nester

te-g-nester performance on O. sativa genome and our test-assemblies:

te-g-nester performance on test-assemblies from holocentric genomes:


RepeatModeler (v2.0.1) and RepeatMasker (v4.1.1) are run on all Flye assembly fasta sequences Metacentrum offer module with repeatmodeler-1.0.11 version without specific LTR retrotransposon determination

RepeatModeler2 flow diagram


2.1.6.1 RepeatMasker is called immediatelly after RepeatModeler2

#!/bin/bash

# input fasta
fa="canuCorrect.correctedReads.fasta"

# RepeatModeler - build DB
~/Documents/Soft/RepeatModeler-2.0.1/BuildDatabase -name db $fa

# RepeatModeler - run
nohup ~/Documents/Soft/RepeatModeler-2.0.1/RepeatModeler -database db -pa 8 -LTRStruct &> run.out

# Annotated species-specific repeat library
consFa=$(find -name "consensi.fa.classified")

# RepeatMasker - run
RepeatMasker -pa 8 -engine rmblast -gff -lib $consFa $fa


Repetitive DNA detected using RepeatModeler-RepeatMasker in Flye assemblies of all species

2.1.7 Prediction of TE-specific DNA modification using modPhred

2.1.7.1 modPhred (Pryszcz & Novoa, 2021)

DNA or RNA modification probabilities information from basecalled FAST5 datasets strong correlation between in fast5 detected modification and bisulphite methylation from NGS sequencing

Nanopore/Bisulphite methylation frequency correlation (from Modified Base Tutorial)

Nanopore/Bisulphite methylation frequency correlation


2.1.7.2 modPhred pipeline:

guppy_basecaller on ‘fast5_pass’ data using dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg modPhred ~/src/modPhred/run -f ref/Os-Nipponbare-Reference-IRGSP-1.0_GCF_001433935.1.fna -o modPhred/ -i workspace/ -t6 ~/src/modPhred/src/mod_plot.py --scatter -i modPhred/mod.gz ~/src/modPhred/src/mod_correlation.py -i modPhred/mod.gz

Schematic overview of the functionalities of each of the 4 modules included in modPhred: (i) modEncode, (ii) modAlign, (iii) modReport and (iv) modAnalysis.


2.1.7.3 Run guppy_basecaller on Metacentum

My storage path and address for scp:

pavel@astat:/mnt/hdd_4TB/MinION_outputs/Osat_1/CTAB_Viktor_1/20200917_0715_MN34570_FAO01582_9d0828e0$ scp osat_fast5_pass.tar.gz storage-du-cesnet.metacentrum.cz:/tape_tape/archive/VO_metacentrum/home/pj_narutro/modPhred/osat_modPhred/.
pavel@debian-10-buster:~$ ssh pj_narutro@radegast.ueb.cas.cz
(BUSTER)pj_narutro@radegast:~$ cd /storage/du-cesnet/tape_tape/archive/VO_metacentrum/home/pj_narutro/modPhred/ecol_modPhred_test/


Scratch job script:

scratchJob_guppyGpu_441_damDcmCpg_lpil.sh


Run a scratch job:

(BUSTER)pj_narutro@radegast:~$ qsub -q gpu -l select=1:ncpus=8:ngpus=1:mem=12gb:scratch_local=10gb scratchJob_guppyGpu_441_damDcmCpg_lpil.sh


2.1.7.4 modPhred run on astat

Reference fasta:

  • publicly available genome assembly
  • de novo assembly
  • chromosome or its part specified by start-end positions
  • nanopore canu-corrected reads

~/src/modPhred/run -f ref/Os-Nipponbare-Reference-IRGSP-1.0_GCF_001433935.1.fna -o modPhred/ -i workspace/ -t6 ~/src/modPhred/src/mod_plot.py --scatter -i modPhred/mod.gz ~/src/modPhred/src/mod_correlation.py -i modPhred/mod.gz

outputs

$ tree modPhred/
modPhred/
├── correlations
│   ├── NC_029256.1.csv.gz
│   ├── NC_029256.1.csv.gz.svg
│   ├── NC_029257.1.csv.gz
│   ├── NC_029257.1.csv.gz.svg
│   ├── NC_029258.1.csv.gz
│   ├── NC_029258.1.csv.gz.svg
│   ├── NC_029259.1.csv.gz
│   ├── NC_029259.1.csv.gz.svg
│   ├── NC_029260.1.csv.gz
│   ├── NC_029260.1.csv.gz.svg
│   ├── NC_029263.1.csv.gz
│   └── NC_029263.1.csv.gz.svg
├── minimap2
│   ├── workspace.bam
│   ├── workspace.bam.bai
│   ├── workspace.bam.bed
│   └── workspace.log
├── mod.bed
├── mod.gz
├── mod.gz.svg
├── modPhred.pkl
├── osatModPhred_igv_snapshot1.png
├── osatModPhred_igv_snapshot2.png
├── osatModPhred_igv_snapshot3.png
└── plots
    ├── basecall_accuracy.svg
    ├── depth.svg
    ├── median_mod_prob.svg
    └── mod_frequency.svg

3 directories, 27 files

###

$ head mod.bed 
NC_029256.1 2828345 2828346 5mC 967 +   2828345 2828346 0,0,151 27  59
NC_029256.1 2828401 2828402 5mC 850 +   2828401 2828402 0,0,59  26  23
NC_029256.1 2828464 2828465 5mC 967 +   2828464 2828465 0,0,186 26  73
NC_029256.1 2828494 2828495 5mC 817 +   2828494 2828495 0,0,41  25  16
NC_029256.1 2828590 2828591 5mC 1000    +   2828590 2828591 0,0,128 28  50
NC_029256.1 2828597 2828598 5mC 583 +   2828597 2828598 0,0,39  26  15
NC_029256.1 2828626 2828627 5mC 1000    +   2828626 2828627 0,0,235 25  92
NC_029256.1 2828633 2828634 5mC 800 +   2828633 2828634 0,0,104 27  41
NC_029256.1 2828637 2828638 5mC 833 +   2828637 2828638 0,0,143 25  56
NC_029256.1 2828652 2828653 5mC 800 +   2828652 2828653 0,0,61  25  24

$ cut -f 4 mod.bed | sort | uniq -c
  38873 5mC
    385 6mA


Mapping coverage of O. sativa genome Nipponbare IRGSP-1.0; GCF_001433935.1 using minimap2


QC plots

modPhred QC plots


IGV visualization

O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0

O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0


O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0 (zoom in)

O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0 (zoom in)


Modifications in co-ocurrence heatmap

O. sativa, chromosome 10 and 11 modifications heatmap.

Comparison of nanopore predicted dna methylations with Illumina data from bisulfite conversion

Bisulfite NGS data processing using Bismark tool

I have obtained over 5M uniq methylated bases accross O. sativa genome (n=5114321; CpG, CHG, CHH) modPhred detected 38873 (5mC) + 385 (6mA) = 39258; i.e. ~ 130.3x lower detection.

# SRAfile download and extraction
$ wget https://sra-download.ncbi.nlm.nih.gov/traces/sra44/SRR/012798/SRR13105618
$ ~/Documents/Soft/sratoolkit.2.4.1-ubuntu64/bin/fastq-dump --split-files SRR13105618

# fastq files subsetting
$ seqtk sample -s100 SRR13105618_1.fastq 1000000 > SRR13105618_1M_subset_1.fastq
$ seqtk sample -s100 SRR13105618_2.fastq 1000000 > SRR13105618_1M_subset_2.fastq

# run Bismark perl script
$ ~/Documents/Soft/Bismark/bismark_genome_preparation --path_to_aligner /usr/bin/ --verbose ref/

$ ~/Documents/Soft/Bismark/bismark --parallel 4 --genome ref/ -1 SRR13105618_1M_subset_1.fastq -2 SRR13105618_1M_subset_2.fastq 

$ ~/Documents/Soft/Bismark/deduplicate_bismark --bam SRR13105618_1M_subset_1_bismark_bt2_pe.bam

$ ~/Documents/Soft/Bismark/bismark_methylation_extractor --gzip --bedGraph SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.bam

$ ~/Documents/Soft/Bismark/bismark2report

$ tree bismark_methylation_extractor_outputs/
bismark_methylation_extractor_outputs/
├── CHG_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CHG_OT_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CHH_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CHH_OT_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CpG_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CpG_OT_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── extrList.txt
├── osat_bisulfite_methBases_forR.tsv
└── osat_bisulfite_methBases_subsetDiv250_forR.tsv

$ head bismark_methylation_extractor_outputs/CHG_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
Bismark methylation extractor version v0.23.0
SRR13105618.93013374_93013374_length=150    +   NC_029260.1 11570272    X
SRR13105618.93013374_93013374_length=150    +   NC_029260.1 11570265    X
SRR13105618.93013374_93013374_length=150    -   NC_029260.1 11570250    x
SRR13105618.93013374_93013374_length=150    -   NC_029260.1 11570247    x
SRR13105618.93013374_93013374_length=150    +   NC_029260.1 11570234    X
SRR13105618.93013374_93013374_length=150    +   NC_029260.1 11570219    X
SRR13105618.93013374_93013374_length=150    +   NC_029260.1 11570213    X
SRR13105618.93013374_93013374_length=150    -   NC_029260.1 11570210    x
SRR13105618.93013374_93013374_length=150    -   NC_029260.1 11570191    x

$ for f in $(cat extrList.txt); do pref=$(echo $f |cut -f 1 -d 'S'| cut -f 1,2 -d "_") && grep -v "Bismark\|NW" $f | grep -P "\t[HXZ]" | cut -f 3,4 | sort | uniq | sed -r "s/$/\t$pref\tbisulfite/g"; done >> osat_bisulfite_methBases_forR.tsv

$ head bismark_methylation_extractor_outputs/osat_bisulfite_methBases_forR.tsv 
NC_011033.1 212370  CHG_OB  bisulfite
NC_011033.1 213507  CHG_OB  bisulfite
NC_011033.1 228292  CHG_OB  bisulfite
NC_011033.1 228410  CHG_OB  bisulfite
NC_011033.1 240374  CHG_OB  bisulfite
NC_011033.1 240381  CHG_OB  bisulfite
NC_011033.1 240424  CHG_OB  bisulfite
NC_011033.1 240440  CHG_OB  bisulfite
NC_011033.1 240450  CHG_OB  bisulfite
NC_011033.1 240463  CHG_OB  bisulfite

$ wc -l bismark_methylation_extractor_outputs/osat_bisulfite_methBases_forR.tsv 
5114321 bismark_methylation_extractor_outputs/osat_bisulfite_methBases_forR.tsv

$ python3 modPhre_bisulfite_intersects.py mod.bed osat_bisulfite_methBases_forR.tsv 
mp_cnt_tot  39258
mp_cnt_match    7009
mp_not_in_bs    277


Bisulfite sequencing (CpG, CHG, CHH) versus ONT (5mC) + modPred methylation detection in O. sativa.
chromosome bisulfite_cnts modPhred_cnts Osat_chrLen bisulfite_perc modPhred_perc Osat_chrLenPerc
NC_011033.1 366 3554 490520 0.01 9.19 0.14
NC_029256.1 502991 2889 43270923 10.76 7.47 12.63
NC_029257.1 438919 884 35937250 9.39 2.29 10.49
NC_029258.1 441648 5834 36413819 9.45 15.09 10.63
NC_029259.1 528138 205 35502694 11.29 0.53 10.37
NC_029260.1 477522 172 29958434 10.21 0.44 8.75
NC_029262.1 431659 4 29697621 9.23 0.01 8.67
NC_029263.1 336324 443 28443022 7.19 1.15 8.30
NC_029264.1 365707 10282 23012720 7.82 26.60 6.72
NC_029265.1 358156 3297 23207287 7.66 8.53 6.78
NC_029266.1 383293 6019 29021106 8.20 15.57 8.47
NC_029267.1 411270 5070 27531856 8.80 13.12 8.04



2.1.8 Comparison of DNA modification in repeats of holo/monocentric plant genomes

2.1.8.1 te-g-nester outputs and modPhred predictions

> Methylated base positions within TEs




2.2 Illumina miSeq short read sequencing


2.2.1 Genomic DNA samples


2.2.2 Run info


2.2.3 MultiQC basic stats

Genome coverage

Coverage Equation The Lander/Waterman equation is a method for computing coverage.

The general equation is:

C = LN / G

  • C stands for coverage
  • G is the haploid genome length
  • L is the read length
  • N is the number of reads

So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×106)/(3×109 bp) = 6.3 This tells us that each base in the genome will be sequenced between six and seven times on average.




3 Repetitive DNA overview - RepeatExplorer

Genomic DNA of all plant species were sequenced using Illumina miSeq. Load of gDNA were adjusted proportionally to their genome size.

3.1 Run analysis on Galaxy server

RepExp Galaxy server link.

3.1.1 Preprocessing of FASTQ paired-end reads

  • Sample size (number of pairs): 1000000
  • Trim reads 1 to 100 position
  • Rename reads

3.1.2 RepeatExplorer2 clustering

  • Paired ends
  • No reads sampling
  • No comparative analysis

3.2 RepExp outputs analysis

3.2.1 Parsing CLUSTER_TABLE.csv outputs




3.2.2 Visualization in R

RepExp outputed repeats were divided into:

  • All annotated
  • Unidentified
  • Class I
  • Class II
  • rDNA with satellites
  • Ogranelles - mitochodrias and plastids





Figure 1.



Figure 2.



Figure 3.



Figure 4.



Figure 5.



Figure 6.



3.2.3 RepeatExplorer - comparative analysis

300000 read pairs per plant 221bp read length for all reads

3.2.3.1 Comparative analysis summary

RE - comparative analysis summary


3.2.3.2 Proportion of annotated repeats

Repeat proportion


3.2.3.3 Principal component analysis

> head(t_df_noZero)
             LPIL LELE OSAT SCER CACU CPAP EUNI DCAP DFIL NVEN DLUS RACE RALP CLUT PQUA
1|10|plastid 3294  441   58   43   36  122   22  141   43  109   33   33  255  265   22
1|11|plastid    9    4   58   48    2    3    0  671  212  628  220  154 1385 1357  164
1|16|plastid  821  107  135   67  130  323   40  568  172  240   81   74  695  523   61
1|18|plastid  139   22   34   23  675  256   14 1534  559   78   30   19  162  152   15
1|21|plastid  703   79  107   68   82  249   22  286   87  253   64   83  591  504   83
1|23|plastid  717   99   97   85  124  260   39  132   53  239   92   67  555  498   80

PCA analysis - all of 268 superclusters

no plastids

Plastids

All

Angela

Athila

SIRE

Tekay

rDNA